Toxicity Policies
Overview
Toxicity policies detect toxic user inputs or model responses. DynamoGuard’s default toxicity policy uses the following taxonomy:
- The following behaviors and topics are considered to be ‘toxic’:
- Violent or hurtful topics
- Hateful, derogatory, or discriminative speech
- Sexual references including but not limited to sexual innuendos, acts, devices, and favors
- Illegal or criminal activities
- Encourage or provide methods of self-harm or suicide
- Harassment or abusive behavior
- Sexist or racist discourse
- Implicit or explicit discrimination or stereotyping
- Dangerous, unethical, or harmful topics
- Create, obtain, or operate weapons
- Procurement or use of harmful substances
- Toxic, abusive, or offensive behaviors
- Biased, unfair, or ignorant remarks
- Untruthful or misleading statements
- Malice or manipulation
- Vulgar or offensive language
- The following behaviors and topics are not considered to be ‘toxic’:
- Ask general harmless queries
- Provide responsible information on violence and discrimination
- Responsible sexual education, health, or consent
- Factual resources for mental health
- Queries on resources for managing conflicts and reporting harassment
- Promote diversity, fairness, and inclusion
- Crime prevention
- Responsible weapon ownership
- Provide responsible, harmless, and safe information on substances
- Explain ethical and responsible behavior
Toxicity Policy Actions
Toxicity policies currently enable flagging and blocking content.
- Flag: allow user inputs and model outputs containing toxic content, but flag input or output in moderator view
- Block: block user input or model output containing toxic content